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ABSTRACT 


An  algorithm  to  evaluate  queries  is  presented  and  it  will  be 
argued  that  the  algorithm  is  minimal  with  respect  to  the  number 
of  relational  accesses  and  with  respect  to  the  merging  of 
inverted  lists.  The  algorithm's  unique  quality  is  its 

efficiency  in  evaluating  partially  inverted  relations.  A  simple 

cost  function  which  can  be  used  to  reflect  the  complexity  of  the 
query  and  to  predict  its  response  time  is  derived  from  the 

algorithm.  The  cost  function  forms  the  basis  of  a  procedure  to 

suboptimize  the  selection  of  the  domains  to  be  inverted.  The 
domains  to  be  inverted  are  selected  by  analyzing  a  typical  set  of 
queries  submitted  to  the  system.  Such  a  method  does  away  with 
usual  methods  of  updating  usage  counters  for  every  domain  and 
relation  in  the  system.  In  this  approach,  the  selection  of  a 
good  set  of  inverted  lists  is  based  on  the  algorithm  which  uses 
those  lists. 
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1.C  Introduction 


The  present  trend  in  data  management  systems  is  to  provide 
powerful  data  manipulation  languages.  Through  these  languages, 
the  user  can  express  queries  involving  multiple  domain/attribute 
qualifications  linked  together  through  such  Boolean  operators  as 
AND,  OR,  NOT,  IS  ONE  OF,  IS  NOT  ONE  OF  etc.  The  system  must  be 
able  to  resolve  the  qualification  part  of  each  query  efficiently. 
However,  the  efficiency  of  a  system  in  resolving  the 
qualification  part  of  queries  is  a  subjective  criteria  and  often 
hard  to  measure  and  to  guarantee  for  a  large  variety  of  queries 
submitted  to  the  system. 

This  paper  considers  two  problems  relevant  to  the  efficiency 
in  resolving  queries  in  relational  data  bases[  1  ]: 

•  The  algorithm  used  to  evaluate  the  qualification  part  of  a 
query. 

•  The  selection  of  the  domains/attributes  to  be  inverted  in 
order  to  enhance  the  speed  of  this  algorithm  while 
minimizing  the  number  of  items  inverted. 

An  algorithm  to  evaluate  queries  is  presented  and  it  will  be 
argued  that  the  algorithm  is  minimal  with  respect  to  the  number 
of  relational  accesses  and  with  respect  to  the  merging  of 
inverted  lists.  The  algorithm’s  unique  quality  is  it’s 
efficiency  in  evaluating  partially  inverted  relations.  A  simple 
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cost  function  which  can  be  used  to  reflect  the  complexity  of  the 
query  and  to  predict  its  response  time  is  derived  from  the 
algorithm.  The  cost  function  forms  the  basis  of  a  procedure  to 
suboptimize  the  selection  of  the  domains  to  be  inverted.  The 
domains  to  be  inverted  are  selected  by  analyzing  a  typical  set  of 
queries  submitted  to  the  system.  Such  a  method  does  away  with 
usual  methods  of  updating  usage  counters  for  every  domain  and 
relation  in  the  system.  In  this  approach,  the  selection  of  a 
good  set  of  inverted  lists  is  based  on  the  algorithm  which  uses 
those  lists.  Ignoring  ^he  algorithm  has  been  the  flaw  in  many 
previously  suggested  selection  procedures. 

Several  experiments  were  madf^  to  test  the  algorithm  and  the 
query  analysis  procedure.  The  results  of  these  experiments  were 
plotted  and  are  included  in  section  4. 
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2.0  Preliminaries 


2.1  Data  Model 


The  model  used  for  the  data  is  the  relational  mod^l  as 
described  by  C'odd[  1  ].  h  relation  is  defined  as  a  subset  of  the 
cartesian  product  of  some  of  the  domains  in  a  data  base.  A 


domain  is 

a  set  of  homogene 

ous  da 

ta  va 

lues . 

For 

example,  in  a 

personnel 

data  base  we 

find 

the 

doma ins 

of 

decimal  valued 

salaries. 

charac^-er  string 

addr*="E 

ses. 

date 

of 

last  promotion. 

etc. 

A  relation  can  be  viewed  as  a  table.  '^he  name  of  the  table 
corresponds  to  the  name  of  the  relation.  The  column  headings  are 
attribute  names  and  the  values  tha^  can  be  used  in  the  column  are 
derived  from  a  domain.  Each  row  is  called  a  tuple  because  it  is 
made  up  of  an  n-tuple  of  values  --  one  for  each  attribute  -- 
which  describes  a  sirxgle  entity  of  a  relation.  A  relational  data 
base  is  composed  of  a  set  of  varying  relations  inter-related 
through  common  domains. 

We  will  consider  each  row  to  be  identified  by  a  unique 
integer.  The  row  numbers  of  the  relation  will  be  referred  to  as 
tuple  numbers  in  ^he  remainder  of  •‘•his  paper.  If  the  relation 
were  stored  as  a  file  where  each  tuple  is  a  record  of  that  file, 
then  a  tuple  number  is  all  that  is  needed  to  access  a  row  using 
the  basic  direct  access  method. 
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2,2  Query  Formulation 

A  query  can  be  divided  into  3  parts: 

•  ACTION:  PFTRIFVE 
INSERT 
PEP LACE 
PFLETE 


•  OBJECT:  A  list  of  domains  that  are  to  be  acted  upon.  The 

AC'T'ION  part  specifies  what  is  to  be  done, 

•  QUALIFICATION:  Defines  a  set  of  tuples  that  are  relevant 

to  the  user's  interest.  This  part  is  also 
referred  to  as  the  retrieval  expression  of 
'■he  query. 

The  qualification  part  of  a  query  involves  Boolean 
expressions  of  conditions  specifi‘=d  on  several  attributes.  The 
following  syntax  specifies  the  set  of  qualifications  QUAL  that  we 
will  consider  on  a  single  relation.  The  tuples  that  satis  fy  the 
Boolean  expressions  are  said  to  be  qualified  and  are  to  be  acted 
upon  as  specified  by  the  rest  of  the  query, 

<gual>  ::=  <qual>  OR  <qual> 


l<qual>  AND  <qual> 
1  NO'^'  <qual> 

I (<qual>) 
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1 <condi tion> 

<condition>  : :=  <3ttrihute>  <comparitor>  <value> 

<comparitor>  ::=  =|^1>1>1<J< 

In  a  relational  data  model,  the  evaluation  of  the  retrieval 
expression  over  a  single  relation  is  simply  resolved  by 
evaluating  the  conditions  for  each  tuple.  However,  the  retrieval 
expression  may  span  across  several  relations  and  there  may  be 
existential  or  universal  quantifiers.  In  either  case,  it  is 
necessary  to  decompose  the  expression  into  nested  passes  where 
each  of  these  passes  can  be  evaluated  by  a  single  scan  of  the 
relation.  The  need  for  several  passes  is  a  major  problei  in 
relational  data  base  systems. 

The  usual  solu''-ior.  to  the  above  problem  is  to  have  a 
processor  that  breaks  -t-he  query  in  all  it’s  passes  and  then  calls 
the  Boolean  processor  for  every  pass,  Pothnie  presents  a 
different  approach  using  what  is  called  Primitive  Boolean 
Conditions  (PBC)[7'j.  The  retrieval  expression  is  divided  into 
PRC  Modules  (PBCM)  and  the  evaluation  of  these  can  be  performed 
in  parallel.  It  is  also  suggested  that  the  PBCM*s  be  modified  as 
they  are  evaluated;  the  idea  is  an  attempt  at  exploiting 
information  gained  during  each  iteration  through  the  algorithm  in 
order  to  determine  other  entries  eligible  for  selection,  or  to 
eliminate  entries  from  the  analysis, 

A 
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Th®  algorithm  tha*-  evaluates  the  retrieval  expression  of  a 
query  will  be  called  the  Boolean  Processor.  This  evaluation  is  a 
mapping  called  the  qualification  mapping 

Qf'AP;  QOAL - >  Relation 

Each  p  in  QUAL  is  mapped  into  a  subset  of  the  Relation, 
Hardgrave  defines  a  relation  =  on  the  set  QDAL  [4].  If  p,g  are 
in  QUAL  then  p  =  g  if  g  car  be  obtained  from  p  by  following  the 
substitution  rules  of  Appendix  A.  Actually,  =  is  an  eguivalence 
relation  and  hence  defines  a  partition  QOAL/=  on  the  set  of 
gualif ications.  Let  [p]  be  the  equivalence  class  of  which  p  is  a 
member.  All  QUAL*s  of  some  equivalence  class  are  equivalent  with 
respect  to  the  properties  of  a  Boolean  Algebra  and  should 
retrieve  the  same  set  of  values  from  a  given  data  base.  The 
importance  of  defining  an  eguivalence  relation  over  QUAL  is  the 
hope  that  for  each  equivalence  class  there  exists  a  standard 
expression  which  is  op-^imal  to  process.  Hardgrave  suggests  that 
conventional  minimization  ♦■echniques  (i.e.  Karnaugh  map)  used  in 
finding  the  minimum  form  of  a  logic  switching  function  may  be 
employed  to  reduce  the  qualification  part  of  a  query[4]. 
However,  these  minimization  techniques  will  give  the  minimum 
switching  func’-ion  only  if  the  result  of  the  switching  function 
is  known  (i.e.  it’s  ON-SFT  is  completely  specified) .  For  QUAL*s, 
the  set  of  tuples  satisfying  the  qualification  part  of  a  query  is 
not  known,  meaning  that  the  minimization  techniques  are  not 
always  applicable. 
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To  find  the  standard  expression  of  an  equivalence  class,  the 
original  expression  formulated  by  the  user  must  be  manipulated 
’somehow*  to  produce  the  standard  expression  of  its  equivalence 
class.  Though  such  a  procedure  is  a  necessity  for  comparing 
algorithms  that  evaluate  retrieval  expressions,  it  remains  an 
opea  research  topic,  Hardgrave  suggests  that  products  of  sums 
(i.e.  conjunctive  normal  form)  may  be  the  standard  expressions 
expected;  he  gives  an  algorithm  that  processes  these,  products  of 
sums  and  minimizes  ^ he  amount  of  information  retrieved  from  the 
data  base. 


etr  ieval_by;_Cgn^en  t 


It  is  the  purpose  of  data  base  systems  to  allow  retrieval 
not  only  on  the  primary  key  of  the  file  but  on  any  set  of  domains 
of  the  file.  This  type  of  retrieval,  known  as  content  retrieval, 
can  degrade  system  response  to  an  unacceptable  level.  In  order 
to  enhance  system  response  and  to  make  content  retrieval  more 
viable,  inverted  files  are  coupled  with  the  relation.  The  use  of 
inverted  files  has  long  demonstrated  its  ^=‘f f ectiveness  in  systems 
requiring  arbitrary  Boolean  qualifications. 

Definition  of  Inverted  File 


Let  Dj 
domain  j. 


'***'^L^  values  for 


DEFINITION  2.3.1 
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Vjj  =  {  p  /  1  <  P  <  m  and  ) 

where  m  is  the  number  of  tuples  in  the  relation. 

VjY  is  the  set  of  tuple  numbers  in  a  relation 

whose  tuples  contain  value  v-,-  for  domain  D;  . 

J''  J 

For  domain  i,  there  are  L  such  sets.  Each  V.  is 

J"- 

called  an  *inver-^ed  list’. 

DEFINITION  2.3.2 

''j  =  . 5 

V.  is  known  as  an  ’Inverted  File*  for  domain  j.  Note 
-J 

that  for  each  relation,  the  sets  in  VJ  are  disjoint. 

Fach  tuple  can  have  only  one  value  for  a  domain. 

The  presence  of  an  invert'='d  file  produces  side  effects  on 
the  system.  On  •‘•he  one  hand,  an  inverted  file  Vj  improves  the 
speed  of  evaluation  of  a  query  if  domain  j  is  involved  in  the 
qualification  part  of  ’■hat  query.  That  is,  ’-he  invert^ed  file  of 
a  domain  provides  information  to  the  Boolean  processor  for 
determining  qualified  tuples  wi’-hout  it  having  to  access  the 
relation  itself,  n  all  but  the  most  degenerate  of  cases,  this 
inverted  file  is  smaller  and  easier  to  interpret  than  the 
relation.  On  the  o’-her  hand,  implies  additional  storage  and 
complicated  updates.  Storage  is  needed  for  the  relation  and  for 
the  inverted  file  Vj  ,  The  actual  storage  required  for  each 
inverted  file  is  dependent  on  the  data  structure  and  the  mathod 
employed  to  access  a  specific  inverted  list. 
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The  updating  process  is  complicated  because  whenever  the 
A.CTrON  is  any  of  INSfPT,  PEPLACE,  DELETE  and  the  domain  j  is 
involved  in  the  OBJECT  part  of  the  query,  the  inverted  file  V 
must  reflect  the  change.  Hence,  not  only  the  relevant  tuples  are 
changed  but  the  invert-^d  file  of  every  domain  in  the  OBJECT  part 
are  updated  independent  of  the  fact  that  the  domain  may  not  have 
helped  in  the  qualification  mapping  of  the  query. 

Whether  the  inverted  file  of  a  domain  should  be  included  or 
not,  can  be  decided  only  after  studying  the  improvement  in 
response  time  versus  the  additional  storage  cost  and  updating 
cost  incurred  by  the  presence  of  the  inverted  file.  This 
analysis  is  necessarily  dependent  on  the  algorithm  used  in  the 
Boolean  processor. 

2^U_R  estrict  ions_gn_The_2]ii^iiii£§.li2^_£§:L^ 


In  the  context  of  this  paper,  it  is  important  to  note  that 
though  the  comparators  {#,<,<,>,>}  are  allowed  to  be  present  in 
the  qualification  part  of  the  query,  they  are  replaced  by  a 
sequence  of  operations  involving  only  the  *=•  comparator,  before 
the  retrieval  expression  is  evaluated.  This  is  because  an 
inverted  list  points  to  the  tuples  whose  values  are  equal  to  the 
values  of  the  list. 

The  restriction  that  only  the  comparitor  •=*  for  the 
algorithm  evaluating  the  qualification  part  of  a  query  is  not 
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unusual.  For  ono,  comparators  {<,<,>,?:}  are  normally  for  domains 
of  numeric  type.  They  are  necessary  to  specify  a  range  of 
numeric  values  for  such  domains  as  SALA.RY. 

Whenever  a  condition  involving  any  one  of  the  comparators 
{<,<,>,>}  is  present,  the  action  taken  is  as  follows: 

•  If  the  domain  has  a  corresponding  inverted  file,  then  the 
inverted  lists  af  all  the  values  involv^^d  in  the  specified 
range  are  merged  in  a  final  list. 

•  If  the  domain  is  not  inverted,  the  condition  can  be 
evaluated  as  stated,  leaving  the  comparator  as  is,  (e. g.  D 
>  2000)  . 

As  for  relation  '/*,  the  Boolean  operator  NOT  is  equivalent.  The 
other  operators  IS  ONF  OF  ,  IS  NOT  ONE  OF  and  ALL  OF  are  replaced 
by  a  sequence  of  OR,  a  sequence  of  OP  proceeded  by  a  NOT,  and  by 
a  sequence  of  AND  respectively. 

Hence,  it  is  clear  that  even  if  the  comparitors  or  Boolean 
operators  are  restricted  for  the  algorithm  processing  the 
qualification  part  of  a  query,  it  does  not  limit  the  formulation 
of  the  retrieval  expression  in  any  way. 

2 . 5_Derspecti ve_o f  Thesis 

It  is  thought  that  the  Boolean  processor  or  qualification 
mapping  is  directly  linked  to  the  problem  of  choosing  the  domains 
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to  be 


inverted 


Section  3  presents  an  algorithm  for  the 
qualification  mapping  and  section  U  demonstrates  how  the 
algorithm  is  used  to  choose  the  inverted  domains  when  given  a 
specific  response  tine  as  a  constraint. 
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3.0  r.r.  A.1  gori"^ hiP  for  the  Boolean  Processor 


3^1_\lternative  Strategies  for_ the  Boolean  Processor 

If  the  retrieval  “^^xpression  is  to  contain  conditions  on  anv 
of  the  defined  domains,  the  Boolean  processor  may  adopt  a 
specific  attitude  to  guarantee  expediency  in  the  evaluation  of 
the  query.  Four  possible  strategies  are  listed  here. 

S  tra  teg_^_2 

An  inverted  file  is  included  for  all  k  domains  of  the 
relation.  The  retrieval  expression  is  evaluated  by  considering 
the  inverted  lists  as  sets  and  performing  set  intersection  for 
every  Boolean  operator  AND  and  set  union  for  every  Boolean 
operator  OP.  The  time  required  to  obtain  a  final  list  is  of  the 
order  of  the  product  of  list  lengths  if  the  tuple  numbers  are  not. 
sorted  within  each  list;  if  the  lists  are  sorted  and  merged 
correctly,  the  algorithm  is  of  the  order  of  the  sum  of  the  list 
lengths(c.f.  ShapiroT  P  1)  . 

Such  a  strategy  usually  encounters  an  extensive  amount  of 
data  redundancy  in  the  form  of  inv^^rted  files  and  involves 
excessive  overhead  in  the  updating  procedures  of  the  system. 
These  side  effects  were  explained  in  detail  in  Section  2.3. 
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The 


Strate2l_2 

Inverted  files  are  included  for  the  most  used  domains, 
user  may  be  required  formulate  the  retrieval  expression  in  two 
parts.  One  part  consisting  only  of  inverted  domains  and  the 
other  part  of  only  non-inverted  domains.  The  part  consisting  of 
inverted  domains  is  evaluated  first  producing  a  final  list  the 
same  way  as  in  strateay  one;  the  second  part  is  processed  by 
reading  every  tuple  number  appearing  in  the  final  list  of  part  1 
and  evaluating  all  i-hp  conditions  of  part  2  on  each  one  of  them. 

Pestricting  thp  formulation  of  retrieval  expressions  in  that 
way  forces  the  user  -^o  know  the  data  paths  in  the  data  base  and 
makes  him  responsible  for  the  efficiency  of  the  Boolean 
processor.  Both  of  these  conflict  the  goal  of  ’data 
independence*  (c.f.  Pate  [2]).  In  addition,  some  retrieval 
expressions  cannot  be  divided  in  such  a  way  and  the  user  may  have 
to  submit  two  or  more  queries  storing  temporary  results  after 
each  query. 

This  strategy  attempts  to  save  on  storage  of  inverted  lists 
and  overhead  in  the  updating  procedures  at  the  expense  of  data 
independence  and  user  convenience. 

Strat e2Y_3 

Only  the  most  used  domains  are  inverted,  but  no  restriction 
is  iaposed  on  the  user.  If  the  retrieval  expression  involves 
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conditions  on  inverted  f^nd  non-inverted  domains,  evaluation  is 
done  by  disregarding  the  inverted  files  and  scanning  the  entire 
relation.  This  kind  of  strategy  stems  from  the  lack  of  valid 
algorithms  to  merge  inverted  lis-^s  and  evaluate  conditions  on 
non-inverted  domains.  Nevertheless,  this  strategy  is  inflicted 
with  the  harmful  side  effects  of  inverted  files  and  does  not 
profit  from  the  expediency  that  results  in  using  inverted  files 
for  guery  evaluation. 

5 1  ra  t  egy_4__ 

Only  the  domains  that  can  best  improve  the  evaluation  of  a 
set  of  queries  are  inverted  and  no  restriction  is  imposed  on  the 
user.  Queries  formulated  with  inverted  and  non-inverted  domains 
are  handled  to  make  the  best  use  of  inverted  information.  This 
strategy  must  use  an  algorithm  that  mixes  the  use  of  inverted 
lists  mergers  and  the  evaluation  of  conditions  on  non-inverted 
domains  of  retrieved  tuples. 

algorithm  to  resolve  retrieval  expressions  according  to 
strategy  four  is  presented, 

3.2  A.n  Algorithm  for  Query  '^valuation 

The  algorithm  rf^^^guires  that  a  binary  tree  be  formed  from  the 
retrieval  expressions.  The  substitutions  mentioned  in  Section 
2.4  are  made  as  the  tree  is  built.  As  a  result,  the  tree  has 
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OR,  NOT)  or 


nodes  representing  Boolean  operators  (i.e.  A.ND, 
conditions;  for  this  reason,  the  tree  can  be  called  a  ’Boolean 
tree*.  The  nodes  that  represent  conditions  form  the  leaves  of 
the  tree;  depending  if  the  domain  is  inverted  or  not,  the  leaf  is 
called  an  ’inverted  leaf*  or  a  * non-inverted  leaf*.  The  other 
nodes  are  called  *AND-node*,  *OP-node*,  and  ‘NOT-node’  indicating 
the  Boolean  operator  which  they  represent. 

3.2.1  Cost  Function  for  the  Boolean  Tree 


A.  cost  is  associated  with  every  node  of  the  tree;  this  cost 
is  evaluated  from  the  highest  level  of  the  tree  (i.e.  from  the 
leaves)  and  gradually  to  lower  levels  till  the  root  is  reached. 
By  the  cost  of  a  branch  is  meant  the  cost  of  the  node  pointed  to 
by  the  branch.  The  calculation  of  the  cost  varies  for  each  node. 

'^he  cost  of  a  node  will  be  ei-^her  positive  or  negative.  The 
sign  is  merely  used  as  a  flag  to  outline  the  units  implied. 
Positive  costs  represent  records  read  from  secondary  storage 
while  negative  costs  are  equal  to  the  numhc^r  of  inverted  list 
pointers  to  bo  processed  by  ROOL_PROCESSOR . 

The  cost  of  a  roo-^  node,  i.e.  the  cost  of  a  query  itself,  is 
composed  of  the  following  considerations; 

•  Which  domains  appear  in  the  query 
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•  The  physical  storage  representation  (i.e.  presence  of 
inverted  files  for  the  domains) 

•  The  role  play='d  by  a  domain  with  respect  to  the  others.  A 
domain  that  is  linked  to  another  domain  through  an  OR 
operator  plays  a  different  role  than  if  it  is  linked 
through  an  AND  operator  because,  in  the  second  case,  it 
directs  th<=>  evaluation  of  the  retrieval  expression. 

The  cost  for  each  noie  is  defined  as  follows: 

Leaves : 

For  an  inverted  li=‘af,  the  cost  is  -n  where  n  is  the  length 
of  the  inverted  list  substituted  for  the  condition.  For  a 
non-inverted  leaf,  the  cost  is  +m  where  m  is  the  total 
number  of  tuples  in  the  relation  as  defined  in  Section  2.1. 

AND  node: 

Two  cases 

•  The  cost  of  ^he  right  and  left  branch  is  negative: 

Cost  =  ?1aximum  (Cost  of  left  branch.  Cost  of  right  branch) 

•  Not  both  branches  have  a  negative  cost 

Cost  =  Absolut  e-^^a lue  (Minimum  (Cost  of  left  branch.  Cost 
of  right  branch)) 


OR-n  ode : 

Three  cases 
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•  Both  branches  have  a  negative  cost. 

Cost  =  Maximum  (Cost  of  left  branch  +  Cost  of  Right  branch,  -m) 
where  m  =  file  size 

•  Both  branches  have  a  positive  cost 


Cost  =  Minimum 

(sCost  of  left 

branch  +  Cost 

of  right 

branch,  +m) 

wher 

e  m  =  file  size 

•  The  cost  of  the 

branches  are 

of  different 

sign 

Cost  =  Maximum 

(Cost  of  left 

branch.  Cost 

of  right 

branch) 

NOT-node 


Two  cases 

•  The  cost 

of  the 

branch 

is  posit 

Cost 

=  Cost 

of  the 

branch 

•  The  cost 

of  the 

branch 

is  negat 

Cost 

=  -ro 
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tima  along  expensive  ones.  Thus,  the  cost  function  plays  a  key 
role  in  the  algorithm  that  evaluates  the  retrieval  expression  and 
it  relieves  the  user  from  being  responsible  for  the  efficiency  of 
resolving  his  query. 

Whenever  the  physical  storage  representation  is  changed,  the 
cost  function  is  automatically  updated  because  the  cost 
information  is  an  inherent  part  of  the  schema. 

I*.2i2_ne  scri2tior_of_^l(jorithm_n00L_PR0CFSS0? 

The  Boolean  processor  ROOL_PROCESSOP  consists  of  2  ma'jor 
sections.  First,  a  PRIVFP  which  calls  the  tree  evaluating 
algorithm  and  returns  only  the  tuple  numbers  which  satisfy  the 
retrieval  expression.  The  second  section  is  the  recursive 
algorithm  '^FFE_FVAL01iTF  tha+-  traverses  the  tree  to  evaluate  the 
retrieval  expression. 

Inverted  lis-»-s  serve  to  ’step  up’  tuple  numbers  that  are 
returned  from  the  tree  evaluating  algorithm  and  as  a  result,  a 
minimum  number  of  tuples  are  accessed  from  secondary  storage. 
But  to  be  able  to  maximize  the  use  of  inverted  lists  in  this  way, 
it  is  necessary  that  the  algorithm  traverse  the  tree  along  the 
branches  having  the  smallest  cost.  To  make  that  decision,  every 
node  must  have  informa-^ion  about  the  cost  of  it’s  branches.  It 
is  not  necessary  to  keep  the  actual  cost  of  the  left  and  right 
branches  of  a  node.  Instead,  as  the  tree  is  built,  a  bit  can  be 


Page  18 


set  to  indicate  which  branch  is  ’cheapest*  or  the  tree  can  be 
built  so  that  all  left  (or  right)  branches  are  ’cheapest*. 

Inverted  lists  must  be  sorted  in  ascending  order  of  tuple 
numbers.  Though  this  results  in  moving  parts  of  the  lists 
whenever  a  pointer  is  added  or  deleted,  benefits  for  searching 
are  considerable  if  the  list  length  is  large.  For  sorted  lists 
the  searching  is  of  the  order  log  n  while  for  unsorted  lists  it 
is  of  the  order  r  where  n  is  the  list  length, 

ii2j,25_1_Data_Structures_f  or_BOOL_PROCESSOR 

The  algorithm  and  associated  data  structures  will  be 
discussed  in  PL/T, 

Global  Variables 

(1)  FILE:  An  array  whose  dimensions  is  determined  by  the 

FILE__SIZE  and  ^he  NO_OF_DOM ATNS  in  the  file.  ^11  the  data 
of  the  relation  is  kept  in  this  array.  Each  row  is  a 
tuple. 

(2)  DOMAIN_TABLF:  A.  structure  describing  each  domain  of  the  FILE. 

(2.1)  NO_OF_VALnFF :  An  integer  equal  to  the  number  of  values 

for  a  domain. 

(2.2)  INVEPTED_LI ST’^PTH :  A  pointer  to  the  based  variable 

INVERTED  LIST. 
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(3)  INVFPTED_LirT:  T-.  based  variable  which  contains  information 

about  every  legal  value  of  a  domain. 

(3.1)  N0_0^_TnPLFS :  An  integer  equal  to  the  number  of 

tuples  in  the  FILE  which  have  this 
particular  value. 

(3.2)  LIST_PTP;  A  pointer  variable  to  the  based 

variable  LIST. 

(4)  LIST:  A  based  variable  that  represents  an  array  whose  entries 

are  tuple  numbers  having  a  specific  value  for  a  domain. 

The  relationship  between  these  structures  is  shown  in  Fig. 
3.1.  A.11  the  domain  names  and  their  associated  values  have  been 

given  integer  values  for  ease  of  processing. 
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File  Structure 
Fig  3.  1 

Required  Procedures: 


READ^^EVA.  (N,  PTR,L OGTC)  :  A  procedure  to  access  a  tuple  number 

specified  by  N  and  evaluate  the  condition  of  the 
non-inverted  leaf  pointed  by  PTR,  LOGIC  returns 
a  true  or  false  depending  if  tuple  N  satisfies 
the  condition  or  not.  This  procedure  is  only 
called  when  the  node  is  a  non-inverted  leaf. 


I 
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Local  Variables 


The  Boolean  tree  is  made  up  of  the  following  types  of  nodes: 

(1)  NODE:  A  structure  containing  information  pertinent  to  an  A.ND- 

node  or  an  OR-node. 

(1.1)  TYPE:  A.n  integer  variable  equal  to  C  for  an  AND-node 

and  ^  for  an  OR-node. 

(1.2)  CHEAPEST:  A  bit  in d ica ■’-.i ng  which  branch  is  cheapest,  1 

for  the  left  branch,  0  for  the  right  branch 

(1.3)  L:  An  integer  variable  equal  to  the  tuple  number 

returned  from.  TREE^EVALUATE  when  the  left  branch 
was  evaluated, 

(1.4)  L_LOr; :  a  variable  that  indica-’-es  if  the  tuple  number 

specified  by  L  satisfies  the  condition  along  the 
left  branch. 

(1.5)  L_BFCH:  A  pointer  variable  pointing  to  the  left  node. 

(1.6)  R,  F_LOn,  E_RRCH :  These  are  the  same  as  L,  L_LOG, 

L_BFCH  respectively  except  they  apply  to 
the  right  branch, 

(2)  NOT^NODE;  A  structure  for  the  NOT-node, 

(2,1)  TYPE:  An  Integer  variable  equal  to  2, 
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(2.2) 


T,  T__LOG,  T_BPCH:  These  are  the  same  as  L ,  L_LOG, 
L^BPCH  in  meaning. 

Variables  L,  P,  and  T  must  be  initialized  to  zero  when  the 
tree  is  built. 


(3)  I NVERTED_IEAF :  A  Structure  for  Inverted  Leaves. 

(3.1)  TYPE;  An  integer  variable  equal  to  3. 

(3.2)  LIST_INDEX:  An  integer  variable  which  is  an  index  in 

the  LIST  of  tuples;  initially  it  is  set  to  1 . 

(3.3)  TnPLFS_LEFT;  An  integer  variable  whose  value  is  the 

number  of  tuples  left  to  consider  in  LIST. 
Initially  it  is  set  to  NO_OF_TnPLES . 

(3.4)  INV_PTR;  A  pointer  variable  equal  ta  LIST_PTP.  It 

points  to  a  specific  allocation  of  LIST. 

(4)  NON__I'NVERTED_LEAF;  A  structure  for  Non-inverted  Leaves. 

(4.1)  TYPE;  An  integer  variable  equal  to  4, 

(4.2)  IA.ST_PEAD;  The  tuple  number  that  was  last  considered. 

It  has  an  initial  value  of  zero, 

(4.3)  CON__rOMAIN;  A.n  integer  variable  which  can  serve  as  an 

index  in  the  DOMAIN  TABLE. 
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(4,4)  c:ON_v?-LaE:  Av.  integer  variable  equal  to  the  value 
specified  in  the  condition  C!ON_DORi\iN  = 
EON  VA.LUF, 


3j^2j_2  ^2^_Alqor  1 1  hm 

Description  of  DFIVFP 

Initially,  the  DRIVFP  calls  the  tree  evaluation  algorithm 
with  a  tuple  number  oqual  to  1;  this  forces  the  algorithm  to 
consider  the  whol^^  relation.  The  algorithm  will  return  a  number 
at  least  as  gr^^at  as  1  which  is  either  *-he  next  highest  tuple 
number  for  which  the  qualification  is  true  or  the  last  tuple  that 
is  actually  returned  in  order  to  evaluate  the  qualification, 

After  that,  the  PPTVFR  calls  the  algorithm  with  a  tuple  number 
that  is  one  grca"^er  than  the  tuple  number  returned  by  the  tree 
algorithm. 

DPIVEF;:  N  =  1 

FTP  =  root  pointer  of  the  query, 

DO_NHILE:  /*  Call  TR EF_E VALHATF  until  either  TUPLE  becomes  equal 

to  the  FrLF_SIZF  or  it  has  a  value  of  -I.  ♦/ 

CALL  TFEF_FVALUATE(N,PTR, TUPLE,  LOC7IC) 

If  the  LOGIC  is  true  store  the  value  of  TUPLE  in  a  list 
TUPLE  =  TUPL^  t  1 
END  of  DO  WHILE 
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END  of  DRIVEE 


Description  of  TRFE_FV?IL  DATE  (N,  PTR  , TUPLE,  LOGIC) 


It  is  recursive 


* 

/ 


The  cos 
Boolean  tree 
cheaper  bran 
larger  porti 
considered  a 
evaluate  one 
number  shou 
expression  is 
Section  3.1, 
lists  first  a 
at  the  very 
the  present  a 
instead,  it 
traverses  as 
satisfies  th 
are  depleted 
that.  If  a 
that  ?^ND-node 
of  the  tree 
not  have  all 
present  aloor 

TREE  EVALUATE 


t 

f  unction 

always  d 

irects 

the  travG 

rsal  of 

the 

along 

the 

branch  to 

the  ch 

eaper  sub 

tree;  t 

hese 

che 

s 

lea  d 

to 

subt  re 

es  whose 

leaves  us 

ually  ha 

ve  a 

on 

of 

i  nv 

erted 

1  ist 

s.  The 

inverted 

lists 

are 

s 

far 

a  s 

poss 

ible. 

until  i 

t  becomes 

necessar 

y  to 

of 

t  he 

non- 

inver 

ted  le 

aves  to 

decide  whe 

ther  a  t 

uple 

Id 

be 

a  cc 

epted 

or 

reject  e  d 

.  If  th 

e  retri 

eval 

f  0 

rmula t  ed 

in  2 

parts 

as  sugg 

ested  by  s 

trategy 

2  of 

then  t 

his  a 

Igori 

thra  wi 

11  go  th 

rough  all 

the  inve 

rted 

nd 

e  va 

lua  t*^ 

the 

conditions  of 

the  second 

part 

only 

e 

nd. 

It 

d  oes 

less  w 

ork  than 

that  stra 

tegy  bee 

ause 

Igorithm  does  not  merge  the  lists  into  a  final  list, 
considers  every  tuple  in  ascending  order  and 
few  nodes  as  possible  to  decide  if  the  tuple  number 
e  retrieval  «=^xpr*=^ssi on .  Inverted  leaves  whose  lists 
return  a  value  of  -1  and  will  not  be  visited  after 
depleted  inv-^rted  leaf  is  a  branch  of  an  AND-node, 
will  no  longer  be  visited.  This  gradual  truncation 
as  the  query  is  (=>valuated  means  that  some  lists  do 
their  tuple  numbers  considered;  that  is  why  the 
ithm  is  better  than  strategy  2. 

requires  the  following  parameters: 
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2  input  parameters 


N:  integer  variable  equal  to  the  smallest  tuple  number 

that  can  be  returnef^.  from  the  algorithm. 

PTR;  A  pointer  variable.  It  is  an  input  parameter  pointing 
to  a  specific  allocation  of  a  node. 

2  output  parameters 

TUPLF:  An  integer  variable  equal  to  the  tuple  number 

returned.  It’s  value  is  at  least  that  of  N,  TUPLE  > 
N.  It  can  also  have  a  value  of  -1  when  a  node 
pointed  to  by  PTF  n'=>ed  not  be  visited  any  longer. 

LOGIC:  A  bit  variable  indicating  if  the  tuple  number  in 

TUPLE  satisfies  the  boolean  subtree  having,  as  root, 
the  node  pointed  by  PTR. 

Note:  We  will  assume  that  the  cheaper  branch  is  always  the  right 

branch . 

TREE_EVALUATF: 

CASF^STATEHFNT  on  node  TYPP 

TYPE  =  0 :  /*  An  AND-nodo 

STEP  0.1;  /*  The  values  of  L  and  R  must  be  increased  to  a 
value  >  N.  If  I,  and/or  R  becomes  equal  to  -1, 
TUPLE  is  set  to  -1  and  a  RETURN  is  made  immediately 
*/ 

Three  situations  arise 
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0.1.1.  L  <  N  and  P  <  N 


Both  are  smaller  than  N.  In  this  case  the 
cheapest  branch  is  followed  and  TP.EE_EVA.LU?irE  is 
recursively  called. 

i.e.  ri\lL  TPEE_EV?^LUATE  (N,P_BRCH,R,  R^LOG) 

0.1.2.  L  >  N  and  -1  <  P  <  N 

The  right  branch  must  be  followed, 
i.e.  CALL  TPFE_EV7iLnATE(L,R_3KCH,R,R_L0G) 

Note  that  the  smallest  tuple  number  that  can  be 
returned  is  that  of  L. 

0.1.3.  Samc^  as  2.  but,  the  left  branch  must  be 
followed.  i.e.  CALL 

TFE?_FV  ALTTATE  (P  ,  L_BPCH,  L,  L^LOG) 

STEP  0.2:  /*  L  and  P  are  both  >  N.  */ 

The  value  of  TPPLE  and  LOGIC  are  set  and  returned. 
Two  situations  arise. 

0.2.1.  L  =  P 

If  both  L__L0G  and  R_LOG  are  true 
then  TUPLE  =  L 

LOGIC  =  true 
else  TUPLE  =  L 

LOGIC  =  false 
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0.2.2.  L  H 


If  f^?iX(L,R)  has  it*s  logical  value  false 


then  THPL^ 
LOGIC 
else  TUPLF 
LOGIC 


WAX  (L  ,R) 
false 

MAX(L,R)  -  1 
false 


PETTJPN 

TYPE  =  1:  /’*'  an  OP-nole  >^7 

STEP  1.1:  The  algorithm  remains  in  this  step  till  L  and  R  > 

N  or  both  L  and  R  =  -1.  If  at  some  time  L  or  P 
become  =  N  and  the  associated  logic  is  true  then 
return  is  made  with  TUPLE  =  N  and  LOGIC  =  true.  If 
P.ETUPn  is  not  made,  the  algorithm  proceeds  to  STEP 
U.  Two  situations  arise  here. 

1.1.1.  L  and  ?  <  N 

The  cheap^^^r  branch  is  followed. 

i.p.  CALL  TFT='E_EVALUATE  (N,P_RRCH,  P,  F_L3GIC) 

1.1.2.  One  of  L  or  R  is  <  N.  Assume  it*s  L, 
i.e.  CALL  TRFE_EVALUATE  (N,  L_BP.CH,  L,  L_L0G) 

STEP  1.2:  /*  TUPLE  and  LOGIC  are  to  be  set  and  returned  */ 

L  =  R  if  L_IOG  is  true  or  R_LOG  is  true 

then  TUPLE  =  L 
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LOGIC  =  true 


else  TrjPLT^  =  L 

LOGIC  =  false 

L  ^  P  then  the  minimum  of  L  and  H  is  returnel.  Of 
course,  if  the  minimum  is  then  the  raaximim  is 
returned  , 

FFTIJPN 

TYPE  =  2:  /*  A  NOT-node  */ 

If  -1  <  N 

then  Cf-LL  TPEE_FVALUATE(N,T_BPCH,T,T_L03) 
else  if  T  =  -1  or  T  >  M 

then  TflPLE  =  N  /*  because  of  this,  the 
LOGIC  =  true  NOT-node  does  not 

else  TOPLE  =  N  step  the  tuple  numbers  */ 

LOGIC  =  -.T_LOG 

RETOPN 

TYPE  =  3:  /*  A.n  inverted  leaf  */ 

LIST_TNPEX  is  incremented  ♦•.ill  the  algorithm  finds  a 
tuple  number  >  N.  In  the  case  that  a  tuple  number  is 
found,  TUPLE  is  set  to  that  number  and  LOGIC  made 
true.  If  LIST  is  depleted,  a  -1  is  returned  for  TUPLE 
and  the  leaf  will  not  be  visited  again 

RETURN 
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TYPE  =  4;  /*  A  n on-in  vertef^  leaf  */ 

Record  N  is  read  if  it  was  not  retrieved  earlier  and  the 
condition  is  evaluated  on  the  tuple. 

RETORN 

END  OF  TREE_EVALO/-TE 

Refer  to  Appendix  3  for  examples  of  TREE  EVALOATE. 

A.n  algorithm  that  evaluates  retrieval  expressions  will  be 
defined  to  be  optimal  if  it  reads  the  least  number  of  tuples  from 
secondary  storage  in  order  to  evaluate  the  Boolean  tree  and 
accesses  as  few  tuple  pointers  from  inverted  lists  as  possible. 
Speed  as  a  criterion  for  optimality  is  difficult  to  consider 
because  a  great  deal  depends  on  the  relationship  between  physical 
storage  structure  and  the  operating  system’s  access  methods. 

The  algorithm  presented  in  section  3,2  is  believed  optimal 
for  a  given  tree.  Because  it  is  not  proven,  we  will  designate 
the  algorithm  as  minimal.  By  rearranging  the  retrieval 
expression  into  the  standard  expression  of  its  equivalence  class, 
as  Hardgrave  sugges^-s,  the  resultant  expression  can  sometimes  be 
evaluated  faster.  But,  since  it  is  not  known  what  the  standard 
expression  is,  optimality  of  the  algorithm  is  argued  with  respect 
to  retrieval  expressions  as  formulated  originally.  This  is  not 
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considered  a  weakness  in  the  argument  because  if  any  algorithms 
do  exist  to  rearrange  the  expression  in  such  a  standard  form, 
they  could  be  used  as  a  preprocessor  for  the  retrieval  algorithm. 
The  algorithm  presented  would  still  be  minimal  for  these  standard 
expressions. 

Access  to  sGcontlary  storage  can  best  be  minimized  if  an 
algorithm  can  form  a  final  list,  as  that  of  strategy  2.  The 
algorithm  ROOL^PPOCFSSOP,  does  exactly  that  by  traversing  the 
tree  as  it  does. 

What  assurance  is  there  that  a  tuple  need  not  be  read  more 
than  once?  This  question  has  been  answered  in  2  ways.  First, 
the  procedure  EE?.D_fva.  keeps  in  core  the  last  K  tuples  read,  and 
second,  false  valued  tuples  are  returned,  from  the  nodes.  The 
actual  value  selected  for  K  depends  on  the  amount  of  core 
available.  Return,  from  a  node,  could  be  made  either  when  a  true 
valued  tuple  number  is  found  or  the  node  is  truncated.  If  that 
approach  is  adopted,  tuples  may  be  read  twice;  consider  Fig  3.2. 
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A  Potential  Pathology 
Fig  3.2 

In  Fig  3,2,  if  C0NP1  is  false  for  all  tuples  1,2,3,...,K 
then  K+1  is  the  firs*-,  tuple  returned  from  the  AND-node  on  the 
left.  When  the  A.Nn-node  on  the  right  is  evaluated,  tuples 
1 ,2 , 3 , . . , , K+ 1  are  all  read  a  second  time.  However,  if  false 
valued  tuples  arp  returned,  all  the  tuples  are  read  only  once, 

3 . U , Addit ional  Comments  on  The  Hse  of  BOOL  PROCFSSOR 

When  an  algorithm  is  made  faster,  it  is  the  usual  course  to 
sacrifice  core  storage;  so  is  it  in  this  case.  The  node 
information  is  larger  and  all  the  inverted  lists  needed  for  a 
query  must  be  present  in  core.  Though  most  algorithms  that  merge 
inverted  lists  require  only  2  lists  to  be  core  resident,  those 
same  algorithms  are  of  the  order  of  the  product  of  their  list 
lengths. 
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Shapiro  presents  an  algorithm  that  merges  completely 
inverted  relations  in  a  similar  fashion  as  TREE_EVA.LUA.TE  and 
points  out  that  algorithms  merging  inverted  lists  by  considering 
all  of  them  at  orce  are  of  the  order  of  the  sum  of  the  lists 
lengths[25].  For  these  algorithms,  the  inverted  lists  must  be 
sorted  in  ascending  order  of  the  tuple  numbers. 

It  is  worth  noting  that  inverted  lists  are  usually  blocked. 
Space  can  be  economized  by  keeping  only  one  block  of  the  needed 
inverted  lists,  in  memory,  at  any  one  time.  The  modifications  to 
TREE_EVALnATE  are  very  minor  and  involve  changing  the  part  of  the 
algorithm  that  rettirrs  a  pointer  from  an  inverted  leaf  (i.e. 
TYPE  =3).  It  is  now  necessary  to  know  if  there  are  any  more 
blocks  in  the  inverted  list  and  if  so,  a  second  block  is  read  as 
soon  as  the  first  is  depleted,  and  so  on  for  all  the  others.  The 
modified  algorithm  is  still  of  the  same  order  but  core  spaca  can 
be  put  to  better  use.  Because  of  the  functioning  of  the 
algorithm,  it  may  happen  that  only  one  block  need  be  brought  in 
core.  Consider  Fig  3.3  as  such  a  case.  The  left  branch  of  the 
tree  in  Fig  3.3  is  referenced  only  once  before  the  right  branch 
is  depleted,  truncating  the  ANP-node  from  the  tree. 

A  second  important  consideration  for  an  on-line  environment 
is  to  minimize  I/O  time.  By  requiring  a  separate  I/O  request  to 
access  each  tuple  would  cause  excessive  real-time  required  to 
evaluate  the  Boolean  expression.  Blocking  tuples  should  be 
considered  as  a  way  to  minimize  T/0  requests.  This  requires  a 
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versus 


study  of  the  locality  of  the  references  of 
another  and  will  not  be  considered  further  in 


one  tuple 
this  paper. 


Only  One  Block  of  The  Inverted  List  is  Read 

Fig  3.1 

3.5  Concluding  Remarks  about  '^h^  BOOL  PROCESSOR 

There  are  five  important  favourable  points  to  be  made  about 
the  algortithm  TP FE_EV.z\LUATF, 

•  A  simple  costing  function  is  used  to  select  the  cheapest  path 

to  follow  in  traversing  the  Boolean  tree.  At  the  same  time, 
an  accurate  cost  can  be  assigned  to  the  whole  guery  and 
reported  to  the  user  before  a  retrieval  is  made. 

•  Minimization  of  accesses  to  secondary  storage  combined  with 

tree  truncation  caused  by  the  depletion  of  inverted  leaves 
produce  a  minimal  algorithm. 

•  Though  more  information  is  required  at  each  node,  the  test  made 

usually  involves  only  integer  comparisons  between  tuple 
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numbers  instead  of  the  more  expensive  character  string 
com  parisons. 

•  condition  on  a  non-inverted  domain  can  be  included  in  the 

retrieval  expression  without  significant  increases  in 
response  time  if  a  reasonable  set  of  inverted  lists  are 
present.  Experimental  results  will  be  presented  in  Chapter 
4  expanding  on  this  point. 


•  In  the  worst  cases  or  the  best  cases,  the  algorithm  does  not 
increase  the  amount  of  work  or  overhead  required  to  resolve 
the  query.  If  the  Poolean  tree  contains  only  inverted 


leaves,  the  inverted 

lists 

are  merged  in  an  optimal 

way.  If 

the 

tree  contains 

only 

non-inverted  leaves,  the 

relation 

will 

be  scanned  by  r 

eading 

each  tuple  only  once. 
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4.0  The  S'^lection  of  Indices 


The  problem  of  selecting  a  good  set  of  inverted  lists  has 
been  studied  by  King  and  Lun  separately;  both  of  them  took  a 
probabilistic  attitude  toward  the  problein[5,6  ].  However,  one 
important  point  was  left  out  of  their  analysis;  tha  *form*  of  the 
retrieval  expression  is  relevant  to  the  problem  of  seleoting 
indices.  That  is,  the  type  of  Boolean  operator  qualifying  a 
domain,  the  position  of  a  domain  with  respect  to  another  in  the 
Boolean  tree ,  and  the  level  at  which  a  domain  appears  in  the 
Boolean  tree  must  also  be  considered. 

Also,  the  raethol  of  selecting  domains  to  be  indexed,  depends 
on  the  Boolean  processor’s  use  of  the  indices  in  evaluating  the 
rf=‘trieval  expression.  Some  exp^-riments  were  carried  out  on  three 
different  relations  and  the  results  are  presented. 

4_j_2_^_Wor  king_MGthod_f  gr_Selecting._Indices 

In  view  of  the  difficulties  in  using  the  other  methods,  it 
was  decided  to  study  the  optimizing  problem  of  selecting  indices 
via  a  different  approach.  "^he  idea  was  to  develop  a  query 
analysis  procedure  that  looks  at  a  given  set  of  sample  queries 
and,  based  on  structural  information  obtained  from  the  retrieval 
expressions  of  the  queries,  select  a  good  set  of  domains  as 
indices  for  a  given  relation. 
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The  study  of  the  method  has  been  limited  to  a  retrieval 
system  only.  However,  the  purpose  of  restricting  the  number  of 
inverted  lists  is  to  make  updating,  on  the  average,  easier. 


The  method  to  be  presented  does  not  require  that  a  usage 
monitor  be  included  in  the  data  management  system.  Instead, 
sample  queries  submitted  to  the  system  are  added  to  a  ’query 


file*.  When  performance  drops  an 
believes  file  reorganization  i 
program  inputs  all  the  queries  fr 
which  domains  should  be  inverted, 
extended  to  the  previously  descri 
same  obvious  advantages  over  a 
no  extra  burden  for  most  systems 
kept  for  error  recovery  purposes. 


d  the  data  base  administrator 
s  necessary,  the  query  analysis 
om  the  query  file  and  decides 
The  use  of  a  query  file  can  be 
bed  methods  and  would  give  the 
monitor.  This  file  is  actually 
because  query  journals  must  be 


The  structural  information  that  was  considered 
the  selection  of  indices,  is  the  following: 


The  level  at  which  a  domain  appears  in  the 
because,  for  example,  a  domain  appearing  at  a 
the  tree  and  linked  up  to  an  AND-node  has  a 
on  the  query. 


important  to 

Boolean  tree 
low  level  in 
major  effect 


•  The  joint  probability  of  the  domains  to  appear  in  the  same 
retrieval  expression, 

•  The  type  of  Boolean  operator  that  is  used  with  a  domain. 
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•  The  amount  of  information  contained  in  the  inverted  lists 
of  a  domain  (i.e.  The  number  of  values  in  the  domain). 

The  method  makes  use  of  the  cost  function  defined  in  section 
3.2.1.  Recall  that  this  cost  function  measures  the  work  required 
to  evaluate  the  retrieval  expression.  It  is  also  a  measure  of 
the  influence  of  the  various  leave's  in  the  boolean  tree  and  of 
their  relationship  to  one  another  through  sharing  A.ND-nodes,  OR- 
nodes,  NOT-nodes,  and  subtrees.  The  selection  of  the  domains  to 
be  inverted  is  made  by  choosing  that  combination  of  domains  that 
give  the  minimum  average  cost  for  the  queries  residing  in  the 
query  file.  The  method  is  described  more  fully  below. 

The  procedure  gives  the  data  base  administrator  the  best 
selection  of  1  domain,  2  domains,  3  domains  and  so  on,  up  to  k 
domains.  Which  one  of  th^^se  is  chosen  depends  on  the  constraints 
put  on  the  system.  If  the  constraint  on  response  time  is  met 
after  only  a  fraction  of  the  k  domains  are  inverted,  then 
inverted  files  will  be  formed  or  only  this  proper  subset  of  the  k 
domains.  The  tests  performed  on  the  procedure  are  explained  in 
secti on  4.2, 

Query  Analysis  Procedure 
V  ariables : 

COMBINATION:  An  integer  array  of  k  entries.  This  array  is  used 

to  store  the  selected  domains. 
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I NVERTED_DOMAlIN;  An  integer  variable  equal  to  the  numbsr  of 

domains  that  have  been  selected. 

TPY^DOMAIM:  An  integer  variable  equal  to  the  domain  number  used 

to  fill  the  (INVFRTED_DOnAIN  t  l)th  entry  of  array 
COMBINATION. 


AVG^COST; 

An 

integ=>r 

variable 

equal  to 

the  ave 

rage  cost  of 

all 

t  h 

e  qu-^ries 

when  the 

doma ins 

invert 

ed  are  in 

the 

(I 

NVFP.TED_POMAIN  +  1) 

first  en 

tries  of 

COMBINATION 

• 

MIN_COST: 

An 

integer 

variable 

equa  1 

to  the 

cost  of  the 

best 

COMBINATION  of  INVFPTED^DOM AIN  t  1  domains. 

BEST_DOMAIN:  An  integer  variable  equal  to  the  domain  number  that 

gives  the  best  combination  when  used  to  fill  the 
INVEP'^FD_T>OMATN  +  1  entry  of  COMBINATION. 

Required  Procedure: 


NODE_COST  (PTP) :  h  procedure  that  returns  the  cost 


•r-l 

o 

t  e 

d  to  by  FTP. 

The 
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of  an 
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T_ 

nverted  leaf  can  be  calculated 

that 
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are  uniformly 
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e  tup 
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s  in  the  file. 

In  such  a  case 
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cost 

of 

an  i 

n verted  leaf 

= 

(nu 
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of 

tuples/number 

of  values  in  Dj 

of  the  node 

by  assuming 
dis  trib  uted 
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Tf  the  data  base  administrator  knows  exactly  how  many  tuples 
have  a  particular  value,  then  the  cost  of  an  inverted  leaf 
can  be  calculated  more  accurately. 

Description  of  Method 

STEP  0;  The  queries  are  input  and  for  each  of  them  a  boolean 

tree  is  built 
TNVEBTED_DOMAIN  =  0 
STEP  1:  If  INVEP'^FD_DnMATN  >  k 

then  ’termination* 
else  MTN_C0ST  =  file  size 
DOMAIN  =  0 


STEP  2: 


/*  The  entry  IN V’=’PTED_D0M AIN  +  1  in  array  COMBINATION 
is  filled  in  turn  by  all  the  domains  not  appearing 
among  the  firs-^  INVERTEr)_DOMAIN  entries  of  that 
array  */ 


TFY 

1 

o 

AIN 

♦  1 

N  > 

k  th 

en  0 

0  TO 

STEP 

IN  i 

s  am 

ong 

■^he 

first 

of 

COM 

BIN  A 

TION) 

0  TO 

STE 

P  2 

STEP  3: 


/*  The  average  cost  of  the  set  of  queries  is 

evaluated  when  IN  VFRTED_DOMAIN  1  domains  are 

inverted.  The  actual  set  of  inverted  domains  is 
specified  by  array  COMBINATION  */ 
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AVG  COST 


0 


For  all  the  oueries  do 

aVG_COST  =  AVG_COST  +  NODF_COST  (root  node  of  query) 
AVG_COST  =  AVG_COST  /  number  of  queries. 

If  AVG_COST  <  riIN_COST 

then  F!IN_OOST  =  AVG^COST 

PFST_,D0f1A  IN  =  TPY^POMAIN 
else  GO  TO  STFP  2 

STFP  4:  /*  The  best  selection  of  IN V^PTED_DOMAIN  +1  domains  is 

known  */ 

PRINT  the  best  combination 
INVFRTFD^DONAIN  =  IN VFRT ED_DOM AIN  +  1 
GO  TO  STEP  1 

END  of  QTTFPY^ANALYSIS  procedure 

By  using  this  query  analysis  procedure,  the  data  base 
administrator  will  know  the  suggested  r  combinations  for  r  =  1  to 
k.  Which  one  of  these  combinations  is  sufficient  to  his 
application  wil3  be  decided  by  determining  the  acceptable 

response  time.  The  cost  of  a  query  is  equal  to  the  number  of 

records  read  while  the  response  time  of  a  query  is  defined  in 

Section  4.2.2  as  -^-he  number  of  records  read  plus  all  those  blocks 
of  the  inverted  lists  that  were  read  because  of  the  presence  of 
inverted  domains  in  the  queries.  Cost  and  response  time  are  thus 
very  closely  related  (This  will  be  demonstrated  in  Section 
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4,2.3).  The  data  base  administrator  must  then  chDose  that 
combination  whose  cost  is  less  than  the  desired  response  time. 


4.2  Experimental  Pesults 

The  aim  of  the  experiments  described  in  this  section  was 
twofold.  The  first  purpose  was  to  test  the  worth  of  the 
algorithm  BO0L_PFO2F S SOR  by  plotting  2  graphs;  inverted  file 
storage  versus  response  time  and  %  of  domains  inverted  versus 
response  time.  From  the  curves,  it  would  be  possible  to  decide 
if  the  algorithm  has  som^  value.  The  second  part  of  the 
experiment  was  to  test  the  reliability  of  the  query  analysis 
procedure  presented  in  Section  4.1, 

4.2.1  Experimental  Pata 

Test  data  was  partially  generated  for  all  the  experiments, 
A.  total  of  three  files  were  produced,  each  of  them  having  a 
corresponding  set  of  queries. 

The  first  file  was  taken  from  the  Gradua“^e  Student  Handbook 
which  contained  information  on  all  the  courses  offered  to 
graduates  in  the  department  of  computer  science  at  the  University 
of  Toronto[3].  The  domains  gave  information  about  the  course 
code,  the  level  of  difficulty  for  the  course,  when  the  course  is 
given  (i.e.  fall  or  spring),  the  field  under  which  the  course  is 


Page  42 


classified,  the  department  offering  the  course,  the  lecturer's 
name  and  finally  the  average  mark  for  the  course  during  the  last 
semester.  The  relation  COORSE  consisted,  in  all,  of  7  domains 
(i.e.  CODE, LEVEL, TIME, FIELD, DEPT, PROF, AVG) .  Before  the  relation 
and  the  accompanying  set  of  queries  could  be  used,  all  the  domain 
names  and  domain  values  were  diqiti?:ed. 

A  program  was  written  to  produce  all  the  other  test  data. 
This  program,  called  a  data  generator,  output  a  relation  of 
specified  size  along  with  a  number  of  queries.  The  values  of 
each  domain  were  assumed  to  be  either  normally  or  randomly 
distributed;  when  normally  distributed,  the  mean  and  varianca  was 
input.  The  queries  were  also  produced  from  statistical 
considerations.  The  Boolean  operations  involved  in  the  query 
were  selected  by  dividing  the  real  number  interval  0  to  1.0  into 
4  different  parts.  The  interval  0,  to  ,10  was  for  the  NOT-node, 
.10  to  .45  for  the  ANP-node,  .45  to  .BO  for  the  OR-node  and  .80 
to  1.0  for  a  leaf.  The  domains  involved  in  the  queries  were 
assumed  to  he  normally  distributed  and  their  selection  in  a  query 
was  done  using  an  arbitrary  mean  and  variance.  The  data 
generator  required  that  -f-he  following  information  be  provided  as 
input; 


”  The  number  of  tuples  in  the  relation  (i.e  file  size), 
-  The  number  of  domains. 


Page  43 


For  each  domain:  the  type  of  distribution,  the  number  of 
values,  the  mean  and  variance  if  the  values  were 
normally  distributed. 

-  The  maximum  depth  of  the  Boolean  tree. 

The  mean  and  variance  of  the  normal  distribution  used  to 
select  the  domains  involved  ir  the  leaves. 

-  The  number  of  qu<=^ries  to  be  produced. 

The  queries  produced  had  to  be  scanned  to  verify  their 
semantic  correctness.  The  r^^^sulting  t^st  data  had  the  following 
c  ha ra  c ter i sties : 

File  2:  90  tuples 
6  domains 

28  queries  of  a  maximum  depth  equal  to  3. 

^ile  3:  98  tuples 
10  domains 

28  Queries  of  maximum  depth  of  3. 

Of  course,  test  dad-a  can  be  questioned  for  it's  validity. 
However  the  use  of  such  a  da'^a  generator  is  an  attempt  at 
producing  objective  test  data. 
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ii-2j.2_Testin3-2£^BOgL_PPgCESSgP 

As  mentioned  in  the  las^t  section  of  Chapter  3,  the 
performance  of  the  algorithm  ROOL__PROCFSSOR  does  not  seem  greatly 
influenced  when  conditions  on  some  non-invertad  domains  are 
present  in  a  retrieval  expression.  It  was  hoped  that  the  graph 
of  the  fraction  of  the  domains  inverted  versus  response  time 
would  be  a  negative  exponential  curve  as  in  Fig  4.1.  Consider 
Fig  4,1,  the  difference  r^^  -  r^  is  very  small  for  a  comparatively 
large  difference  in  d^  -  dj .  This  suggests  that  if  the  x-axis 
represents  response  time  and  the  y-axis  represents  the  percentage 
of  domains  inverted,  then  a  very  small  improvement  in  response 
time  can  be  expected  after  d  %  of  the  domains  are  inverted.  If 
the  testing  of  n00L_PP0CF SSOP  produces  results  as  those  suggested 
above,  then  the  algorithm  would  be  valuable  for  the  evaluation  of 
queries  for  selecting  indices  when  a  relation  is  only  partially 
inverted . 
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Idpal  Fes^ults 
Fig  4. 1 

The  relations  described  in  Section  4.2.1  were  small  enough 
that  it  was  possible  to  evaluate  the  queries  for  as  many  times  as 
there  are  combinations  of  the  number  of  domains  in  the  file. 
Each  combination  represented  the  subset  of  domains  inverted.  The 
results  were  plotted  for  the  three  files  and  included  in  this 
section  as  Graph  1,  2  and  3. 

The  response  time  of  a  query  was  calculated  from  2  terms, 
the  first  term  being  the  -^ime  required  by  BOOL_PROCESSOR  to 
access  the  rela-^ion,  and  the  second  term  being  the  time  required 
to  access  all  the  inverted  lis-^s  needed  for  the  query.  Because 
the  time  to  traverse  the  Boolean  tree,  the  time  to  evaluate  the 
conditions  and  the  time  to  merge  inverted  lists  are  at  least 
three  orders  of  magnitud^^  smaller  than  the  time  to  access 
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secondary  storage,  they  were  not 
response  time.  For  the  experiine 
inverted  lists  was  made  equal  to 
response  time  thus  becomes: 
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T2  =  (N*k/b)*(h  -i-c  ^  h/d) 


T1  h  c  +  r/d 

T2  Nk/b (h+c+b/d) 

if  r  =  b  then  T1  1 

T2  Nk/b 

A  normalized  value  for  the  response  time 
becomes; 

Normalized  Pesporse  Time  =  K1  +  K2*  (N*k)  /  b,  (i.e.,  Time/TI) 

For  this  experiment  b  =  100  bytes  /  record 

k  =  2  bytes  per  index 

In  the  graphs  storage  versus  response  time  and  %  of  domains 
inverted  versus  response  time  each  point  is  the  response  time 
when  a  particular  combination  of  domains  is  inverted.  The  total 
number  of  such  combinations  for  a  file  of  k  domains  is: 

c:(k,0)  +  C(k,1)  +  r{k,2)  +  ...  +  C(k,k-1)  +  C(k,k) 

For  a  graph  of  7  domains,  there  are  123  such  points.  Pseudo 
horizontal  lines  appear  on  all  fS  graphs;  these  points  represent 
the  average  response  time  to  evaluate  the  set  of  queries  when  the 
number  of  domains  inverted  is  the  same.  For  example,  whan  a 
relation  has  k  domains,  there  are  C(k,r)  different  combinations 
of  r  domains.  The  length  of  these  horizontal  lines  is  equal  to 
the  difference  in  the  average  response  time  when  the  best 
combination  of  r  domains  is  used  and  when  the  worst  combination 
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of  r  domains  is  used.  Graph  1  demonstrates  this  point  very 
clearly. 

Graphs  1,  2  and  3  all  emphasiTie  the  importance  for  the  data 
base  administrator  to  select  the  best  combination.  If  he  fails 
in  that  respect,  the  average  response  time  of  a  set  of  queries 
can  be  increased  by  a  factor  of  4  to  5. 

The  curve  plotted  in  graph  1,  2  and  3  is  formed  by  ioining 
the  points  whose  response  time  is  the  lowest  in  it*s  particular 
class.  In  a  file  of  V  domains,  there  are  y-t-2  such  points.  When 
the  performance  of  TPFF_'!=' VA,LUATE  is  discussed  for  r  of  k  domains 
inverted,  it  is  assumed  that  the  r  domains  selected  for  inversion 
are  those  giving  the  lowest  response  time.  This  set  of  r  domains 
can  be  called  the  best  r  combination. 

Graph  1  is  almost  a  replica  of  Fig  4.1,  the  results  obtained 
for  File  1  are  pGrf<=‘ct.  When  only  ^0%  of  the  domains  are 
inverted,  the  response  time  is  alread  low  and  at  60%,  it  is 
almost  equal  to  the  minimum  response  time  that  can  be  expected. 

The  results  of  File  2  are  not  as  encouraging;  nevertheless, 
one  can  see  from  Graph  2  that  when  more  than  20%  of  the  domains 
are  inverted,  the  slope  of  the  curve  increases  slightly, 
indicating  that  the  improvement  in  the  response  time  is  faster. 
At  the  F0%  point,  the  queries  can  be  evaluated  with  almost  15 
units  of  response  time  larger  than  when  all  the  domains  are 
inverted.  Graph  3  shows  results  similar  to  those  of  File  1;  60% 
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inversion  givps  almost  perfect  response  time.  In  all,  tha 
results  from  all  threp  experiments  do  indicate  that  even  if  a 
rplation  is  partially  inverted  at  a  60*^  to  80%  level,  the 
response  time  is  almost  at  its  minimum  thus  demonstrating  that 
query  evaluation  is  almost  as  fast  when  the  relation  is  partially 
inver^’-ed  as  when  it  is  totally  inverted. 

For  users  who  have  enough  secondary  storage  to  invert  on  all 
domains,  an  algorithm  that  evaluates  queries  on  a  partially 
inverted  relation  as  fast  as  on  a  totally  inverted  relation, 
provides  them  with  a  tool  to  upgrade  their  system  performance  by 
reducing  the  number  of  index  files  that  must  be  updated  in  order 
to  maintain  data  integrity.  For  users  who  have  both  a  response 
time  constraint  and  a  space  constraint,  algorithm  BOOL_PROCF SSOP 
is  not  only  an  esthetic  improvement  but,  possibly,  a  solution. 

The  graph  of  storage  versus  response  time  were  also  plotted 
for  the  three  experiments  and  are  included  in  this  section  as 
graph  and  6.  These  3  graphs  resemble  very  closely  graphs  1, 
2  and  3. 
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The  relations  used  for  testing  were  so  small,  that  it  was 
possible  to  evaluate  the  queries  for  all  the  possible 
combinations;  this  provided  us  with  the  knowledge  of  the  best  r 
combinations  for  r  =  1,2, 3, ...,k.  To  test  the  reliability  of  the 
query  analysis  procedure  of  Section  involved  only  trying  the 
procedure  against  each  set  of  queries  and  comparing  the  suggested 
best  r  combinations  against  the  real  best  r  combinations. 


The  results  of  these  experiments  are  analyzed  by  plotting  a 
curve  cost  versus  response  time  where  cost  is  the  average  cost 
for  the  sample  set  of  queries.  The  points  representing  the  real 
best  r  combinations  are  plotted  with  a  circle  and  the  points 
representing  the  suggested  best  r  combination  are  plotted  with  a 
triangle.  Whenever  a  triangle  is  not  plotted  inside  a  circle, 
the  procedure  failed  to  select  the  best  combination.  This 
failure  can  be  measured  by  taking  the  difference  between  the  two 
response  times. 


Graphs  7,  B  and  9  represent  the  plots  cost  versus  response 
time  for  File  1,  2  and  3  respectively.  Graph  7  shows  that  the 
query  analysis  procedure  was  1007>  correct  for  File  1.  Some 
discrepancies  did  happen  for  File  2  and  3;  the  procedure 
performed  poorest  for  File  3,  but  the  largest  effect  of  any 
discrepancy  was  roughly  7  units  of  response  time.  For  all  three 
graphs,  there  were  never  any  discrepancies  at  either  extremes  of 
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the  plot.  It  may  be  possible  to  remove  the  discr epancias  by 
changing  the  cost  function  of  the  NOT-nodes.  The  new  function 
couli  be  - (file  size)  when  the  leaf  connected  to  the  NOT-node  is 
an  inverted  leaf  and  t  (file  size)  for  a  non-inverted  leaf.  It 
must  be  noted  that  this  change  will  affect  both  the  query 
analysis  procedure  and  P00L_^50C^SS0R.  Experiments  were  not 
performed  with  this  modification  due  to  time  constraints. 

The  results  are  suprisingly  good.  It  can  only  be  concluded 
that  the  average  cost  of  a  set  of  queries,  as  calculated  here,  is 
almost  equal  to  the  average  response  time  of  those  queries.  Cost 
can  then  be  interchanaed  for  response  time.  Though  it  is 
admitted  that  the  graphs  storage  versus  response  time  and  %  of 
domains  inverted  versus  response  tira'=‘  are  too  costly  for  real 
life  applications,  whan  cost  is  interchanged  for  response  time, 
the  graphs  storage  versus  cost  and  the  ^  number  of  domains 
inverted  versus  cost  can  be  plotted.  The  results  of  these  graphs 
being  very  much  like  the  ones  of  graphs  1,  2,  3,  4,  5  and  6,  the 
data  base  administrator  now  has  a  method  that  tells  him  exactly 
how  many  domains  are  to  be  inverted  in  order  to  meet  the  space 
and  response  time  constraints;  at  th‘=‘  same  time,  the  method 
informs  him  of  the  domains  to  be  inverted. 
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5.0  Conclusion 


P^n  offecien^.  anl  minimal  algorithm  that  evaluates  queries  on 
partially  inverted  relations  was  explored.  ZVlso,  it  was  noted 
that  the  selection  of  the  domains  to  be  used  as  indices  would 
greatly  affect  the  performance  of  such  an  algorithm.  On 
investigating  the  available  methods  for  optimizing  the  selection 
of  indices,  it  was  found  that  those  methods  are  important  only  as 
a  formalized  treatment  of  the  problem  and  they  can  not  ba 
considered  as  working  ra-^thods  for  actually  selecting  secondary 
keys  (i.e.  Lum[6]  and  King[5]),  This  led  to  the  development  of  a 
query  analysis  procedure  as  described  in  Section  4,1, 

The  experimental  results  for  both  algorithm  B00L_PR0CR SSOP 
and  the  query  analysis  procedure  were  very  rewarding.  These 
results  suggest  the  following.  First,  query  analysis  procedures 
should  be  developed  directly  from  the  query  evaluation  algorithm. 
Secondly,  the  cost  function  defined  in  section  3  is  almost  equal 
to  the  normalized  response  time  of  a  query.  As  a  result,  the 
data  base  administrator  can  decide  what  fraction  of  the  domains 
should  be  inverted  by  plotting  storage  versus  cost  and  the 
percentage  number  of  domains  inverted  versus  cost.  The  ability 
to  interchange  cost  for  response  time  is  a  useful  asset. 

Still  to  be  considered  is  the  affect  of  allowing  updates, 
insertions,  and  deletions  in  the  data  base  model.  Further 
research  can  be  done  on  this  problem,  nevertheless,  one  must 
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remember  that  partial  inversion  of  a  relation  already 
the  maintenance  of  index  files  in  an  update  system  and 
does  touch  upon  the  problem. 


minimizes 
as  such. 
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Appendix  E;  Examples  of  Query  Evaluation 


The  same  relation  and  associated  inverted  files  are 
explain  some  of  the  concepts  presented  in  Chapter  3 
relation  consists  of  4  domains  and  a  total  of  10  tuples (i 
10).  The  relation  is  called: 

PARTS^SUPPLIER  (NAME, NDMBER, SUPPLIES , QUANTITY) 

It  is  assumed  that  the  relatior.  is  normalized. 


Relation:  PARTS  SUPPLIER 


NAME 


NUMBER  SUPPLIES 


1  A  2 

E  3 

A  2 

D  1 

E  3 

C  2 

A  2 

D  3 

E  2 

10  C  1 


a 

fc 

c 

d 

d 

a 

c 

d 

h 

a 


Inverted  Files 


QUANTITY 


1 

3 

2 

2 

1 

3 

1 

2 

3 

1 


NAME 

List 

A 

3, 

7 

1  . 

B 

2, 

5, 

9 

C 

6, 

10 

C 

4, 

8 

NUMBER 

1  List 

1 

1 

1  4,  10 

2 

1 

11,  3,  6,  7,  9 

3 

I 

1  2 ,  5 ,  8 

used  to 
The 
.e.  m  = 


E 


1 


S U P PL IE S  _1_L i s t 


2aA^ITY_I_List 


a 

1 

I  1, 

I 

6, 

10 

1 

1 

1  U  5, 

7,  10 

b 

I 

\  2, 

I 

9 

2 

1 

1  3,  4, 

1 

8 

c 

I  I 

I  I 

I  rr)  I 

I _ I 

I  I 

I  I 

I 

7 

3 

1 

1  2 ,  6, 

9 

d 

I 

I  4, 

5, 

8 

The  retrieval  expressions  resulting  after  the  comparators 
are  substituted  by  an  '=*  are  given. 

Included  in  this  appendix  are  a  series  of  examples 
demonstrating  how  the  cost  function  is  evaluated  and  how 
retrieval  expressions  are  evaluated  by  TREE_EVAIUATE. 

In  order  that  the  examples  be  well  understood,  some 
explanations  are  needed. 

•  All  pointers  are  explicitly  represented  by  arrows. 

•  AND-nodes,  OR-nodes,  and  NOT-nodes  are  represented  as 

fellows : 

r‘T  1  1  r”T  T”‘i  1  T  I 

I  OF]  AND  lOE  )  I  OF]  OR  |0F1  1  NOT  |  OF] 

_ 1 _ I  l_«l _ I _ I I 

OF  is  a  combination  of  I  /  I_LOG,  5  /  E_LCG,  and  T  /  T_LOG. 


E 


2 


when  the  domain  of  a  condition  is  inverted,  the  condition 


is  replaced  by  the  corresponding  inverted  list, 
inverted  leaf  is  as  fellows: 

i.e.  NAME  =  A 


i  3  i  1  1 - 


1 
1 
1 
1 

TUPLES 


^  1_3_  I 

1 

LIST_INDEX  I 


LEF 


'T' 


1 

I 

I 

Inverted  list  for  NAME  =  A 


•  When  the  domain  is  not  inverted  the  leaf  is  as  follows 


1 

CGN  DC MAIN 


I 

4^  ^ 

1  I 
i  I 

I  LAST  READ 

I 

CON  VALUE 


•  The  cost  of  each  node  is  placed  along  the  branch. 


For  all  the  examples,  domains  NAME,  NUMBER,  SUPPLIER 
inverted;  QUANTITY  is  not  inverted. 

Examples  of  the  Cost  Function 


E  -  3 


The 


are 


1.  NABE  =  A  'AND*  QUANTITY  =  2 


3 


Cost  =  3;  a  inaximuin  of  3  records  will  have  to  be  read. 

2.  NDBEEP  =  1  ’OR*  SUPPLIER  =  c 


Cost  =  -4;  4  pointers  of  2  inverted  lists  must  be  considered. 


E 


4 


3.  (NAME  =  A  *AND»  NUMBER 


1)  *0E»  (SUPPLIER  =  a  'AND*  QUANTITY  =  3) 


2L 


Cost  =  St;  A  records  will  be  read  frcm  secondary  storage- 


Examples  of  TREE  EVALU^^TE 


For  these 
the  node  will  be 
node. 


examples,  the  tuple  numbers  that  are  returned  to 
indicated  on  the  right  and  on  the  left  of  each 


E 


1.  NAME 


A  ’AND*  gUANTITY  =  2 


3 


Calls  made  by  the  DRIVER 


N 

lUPLE/LOGIC 

TUPLES  READ 

1 

1F 

1 

9 

3T 

3 

4 

7F 

7 

8 

-  1 

NUMBER  =  1 

*OR*  SUPPLIER  =  c 

-4- 


3T,Tr,  -  1 


T? 


6 


Calls  ty  the  DBIVEF 


N  IDPLE/LOGIC 


1 

4 

5 
8 


3T 

4T 

7T 

101 


TUPLES  RE^.D 


3.  NAME  =  B  *AND*  (SUPPLIER  =  a  • CB »  QUANTITY  =  3) 


3 


Calls  by  the  D5IVER 


N 


TUPLE/LOGIC 


TUPLES  REAL 


1 

3 

6 

10 


2T 

5F 

9T 

-1 


2 

5 

9 


The  ANP-node  has  ccntrclled  the  evaluation 


by  folloving  the  cheapest  branch,  cnly  three  tuples  ware  read 


E 


7 


y.  NAME  =  A  *0R»  »NOT*  {NDMEEE  =  2) 

-\o 


*  r, 

3T,  &  T,  <?r 


Calls  made  by  the  DRIVER 


N  TUPLE/ICGIC 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 


IT 

21 

3T 

4T 

5T 

6F 

7T 

8T 

9F 

10T 


TUPLES  READ 


The  NOT-node  can  cause  considerably  more  work 
expected.  The  reason  being  that  the  NOT-node  does  not 
tuple  numbers  at  all. 


End  of  Appendix  E 
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CSRG-15  PROCESS  STRUCTURING 

J.J,  Horning  and  B.  Randell,  June  1972 
[ACM  Computing  Surveys,  March  1973] 

CSRG-16  OPTIMAL  PROCESSOR  SCHEDULING  WHEN  SERVICE  TIMES  ARE 

HYPEREXPONENTIALLY  DISTRIBUTED  AND  PREEMTION  OVERHEAD 

IS  NOT  NEGLIGIBLE 

Kenneth  C.  Savcik,  June  1972 

[Proceedings  of  the  Symposium  on  Computer-Communication, 
Networks  and  Teletraffic, 

Polytechnic  Institute  of  Brooklyn,  1972] 


CSRG-17  PROGRAMMING  LANGUAGE  TRANSLATION 
W.M.  McKeeraan,  July  1972 


TECHNIQUES 


CSRG-18  A  COMPARATIVE  ANALYSIS  OF  SEVERAL  DISK  SCHEDULING 
ALGORITHMS 

C.J.H.  Turnbull,  September  1972 


CSFG-19  PROJECT  SUE  AS  A  LEARNING  EXPERIENCE 
K.C.  Sevcik  et  al,  September  1972 

[Proceedings  AFIPS  Fall  Joint  Computer  Conference, 
V.  41,  December  1972] 

CSRG-20  A  STUDY  OF  LANGUAGE  DIRECTED  COMPUTER  DESIGN 
David  B.  Wortman,  December  1972 
[Ph.D.  Thesis,  Computer  Science  Department, 
Stanford  University,  1972] 

CSRG-21  AN  APL  TERMINAL  APPROACH  TO  COMPUTER  MAPPING 
R.  Kvaternik,  December  1972 
[M.Sc.  Thesis,  DCS  1972  ] 


♦  CSRG-22  AN  IMPLEMENTATION  LANGUAGE  FOR  MINICOMPUTERS 
G.G.  Kalmar,  January  1973 
[M.Sc.  Thesis,  DCS  1972  ] 


CSRG-23  COMPILER  STRUCTURE 

W.M.  McKeeman,  January  1973 

[Proceedings  of  the  rrsA-Japan  Computer  Conference,  1972  ] 


♦  CSRG-24  AN  ANNOTATED  BIBLIOGRAPHY  ON  COMPUTER  PROGRAM 
ENGINEERING 

J.D.  Gannon  (ed.),  March  1973 


CSFG-25  THE  INVESTIGATION  OF  SERVICE  TIME  DISTRIBUTIONS 
Eleanor  A.  Lester,  April  1973 
[M.Sc.  Thesis,  DCS  1973  ] 

♦  CSRG-26  PSYCHOLOGICAL  COMPLEXITY  OF  COMPUTER  PROGRAMS: 

AN  INITIAL  EXPERIMENT 
Larry  Weissman,  August  1973 

♦  CSRG-27  STRUCTURED  SUBSETS  OF  THE  PL/I  LANGUAGE 

Richard  C.  Holt  and  David  B.  Wortman,  October  1973 

♦  CSRG-28  ON  THE  REDUCED  MATRIX  REPRESENTATION  OF  LR<k) 

PARSER  TABLES 

Marc  Louis  Joliat,  October  1973 
[Ph.D.  Thesis,  EE  1973] 

♦  CSRG-29  A  STUDENT  PROJECT  FOR  AN  OPERATING  SYSTEMS  COURSE 

B.  Czarnik  and  D.  Tsichritzis  (eds. )#  November  1973 


♦  CSRG-30  A  PSEUDO-MACHINE  FOR  CODE  GENERATION 
Henry  John  Pasko,  December  1973 
[M.Sc.  Thesis,  DCS  1973  ] 

CSRG-31  AN  ANNOTATED  BIBLIOGRAPHY  ON  COMPUTER  PROGRAM 
ENGINEERING 

J.D.  Gannon  (ed.)#  Second  Edition,  March  1974 


CSEG-32  SCHEDULING  MULTIPLE  RESOURCE  COMPUTER  SYSTEMS 

E, D.  Lazowska,  Hay  1974 
[M.Sc.  Thesis,  DCS  1974] 

♦  CSRG-33  AN  EDUCATIONAL  DATA  BASE  MANAGEMENT  SYSTEM 

F.  Lochovsky  and  D.  Tsichritzis,  May  1974 


*  CSRG-34  ALLOCATING  STORAGE  IN  HIERARCHICAL  DATA  BASES 

P.  Bernstein  and  D.  Tsichritzis,  May  1974 

*  CSRG-35  ON  IMPLEMENTATION  OF  RELATIONS 

D.  Tsichritzis,  May  1974 

CSRG-36  SIX  PL/I  COMPILERS 

D.B.  Wortman,  P.J.  Khaiat,  and  D, M.  Lasker 
August  1974 


CSRG-37  A  METHODOLOGY  FOR  STUDYING  THE  PSYCHOLOGICAL  COMPLEXITY 
OF  COMPUTER  PROGRAMS 
Laurence  M.  Weissman,  August  1974 
[Ph.D.  Thesis,  DCS  1974] 

♦  CSRG-38  AN  INVESTIGATION  OF  A  NEW  METHOD  OF  CONSTRUCTING 
SOFTWARE 

David  M.  Lasker,  September  1974 
[M.Sc.  Thesis,  DCS  1974  ] 

CSRG-39  AN  ALGEBRAIC  MODEL  FOR  STRING  PATTERNS 
Glenn  F.  Stewart,  September  1974 
[M.Sc.  Thesis,  DCS,  1974] 
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•40  EDUCATIONAL  DATA  BASE  SYSTEM  USER’S  MANUAL 
J.  Klebanoff,  F.  Lochovsky,  A,  Rozitis,  and 

D.  Tsichritzis,  September  1974 

41  NOTES  FROM  A  WORKSHOP  ON  THE  ATTAINMENT  OF 
RELIABLE  SOFTWARE 

David  B.  Wortman  (ed. )  ^  September  1974 

42  THE  PROJECT  SUE  SYSTEM  LANGUAGE  REFERENCE  MANUAL 
B.L.  Clark  and  F.J.B.  Ham,  September  1974 

43  A  DATA  BASE  PROCESSOR 

E.  A,  Ozkarahan,  S. A.  Schuster  and  K.C.  Smith, 

November  1974 

44  MATCHING  PROGRAM  AND  DATA  REPRESENTATION  TO  A 
COMPUTING  ENVIRONMENT 

Eric  C.R,  Hehner,  November  1974 
[Ph.D.  Thesis,  DCS,  1974] 

45  THREE  APPROACHES  TO  RELIABLE  SOFTWARE;  LANGUAGE 
DESIGN,  DYADIC  SPECIFICATION,  COMPLEMENTARY  SEMANTICS 
J.E,  Donahue,  J.D.  Gannon,  J.V,  Guttaq  and 

J.J.  Horning,  December  1974 

46  THE  SYNTHESIS  OF  OPTIMAL  DECISION  TREES  FROM 
DECISION  TABLES 

Helmut  Schumacher,  December  1974 
[H.Sc.  Thesis,  DCS,  1974] 

47  LANGUAGE  DESIGN  TO  ENHANCE  PROGRAMMING  RELIABILITY 
John  D.  Gannon,  January  1975 

[Ph.D.  Thesis,  DCS,  1975] 

48  DETERMINISTIC  LEFT  TO  RIGHT  PARSING 
Christopher  J.M.  Turnbull,  January  1975 
[Ph.D.  Thesis,  EE,  1974] 

49  A  NETWORK  FRAMEWORK  FOR  RELATIONAL  IMPLEMENTATION 
D.  Tsichritzis,  February  1975 

50  A  UNIFIED  APPROACH  TO  FUNCTIONAL  DEPENDENCIES 
AND  RELATIONS 

P.A.  Bernstein,  J.R.  Swenson  and  D.C.  Tsichritzis 
February  1975 

51  ZETA:  A  PROTOTYPE  RELATIONAL  DATA  BASE 
MANAGEMENT  SYSTEM 

M.  Brodie  (ed) .  February  1975 

52  AUTOMATIC  GENERATION  OF  SYNTAX-REPAIRING  AND 
PARAGRAPHING  PARSERS 
David  T.  Barnard,  March  1975 
[M.Sc.  Thesis,  DCS,  1975] 

QUERY  EXECUTION  AND  INDEX  SELECTION  FOR  RELATIONAL 
DATA  BASES 

J.H.  Gilles  Farley  and  Stewart  A.  Schuster , March  1975 
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